Read Graphical Data Analysis with R, Ch. 4, 5
Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class. Data manipulation should not be hard coded. That is, your scripts should be written to work for new data.
library(forwards)
library(ggplot2)
library(dplyr)
[18 points]
Data: useR2016 dataset in the forwards package (available on CRAN)
For parts (a) and (b):
ggplot(subset(useR2016, !is.na(Q20)), aes(Q20)) + geom_bar() + coord_flip()
ggplot(subset(useR2016, !is.na(Q11)), aes(Q11)) + geom_bar() + coord_flip()
d <- useR2016[!is.na(useR2016$Q11) & !is.na(useR2016$Q3), ] %>%
group_by(Q11, Q3) %>%
summarise(count=n()) %>%
mutate(proportion=count/sum(count))
ggplot(d, aes(x=Q11, y=proportion, fill=Q3)) + geom_bar(stat="identity") + coord_flip()
d <- useR2016[!is.na(useR2016$Q11) & !is.na(useR2016$Q3), ] %>%
group_by(Q2,Q3, Q11) %>%
summarise(count=n()) %>%
mutate(proportion=count/sum(count))
ggplot(d, aes(x=Q3, y=proportion, fill=Q11)) + geom_bar(stat="identity") + facet_grid(Q2~.) + coord_flip()
\n) to long tick mark labels. Write a function that takes a character string and a desired approximate line length in number of characters and substitutes a line break for the first space after every multiple of the specified line length.add_line_breaks <- function(string, num){
l <- '\n'
lhs <- paste0('^([a-z]{', num-1, '})([a-z]+)$')
rhs <- paste0('\\1', l, '\\2')
return(gsub(lhs, rhs, string))
}
Q13 - Q13_F. Use your function from part (e) to add line breaks to the responses. Your graph should have one bar each for Q13 - Q13_F.[18 points]
library(robotstxt)
library(rvest)
theme_dotplot<-theme_bw(16)+
theme(axis.text.y=element_text(size=rel(.75)),
axis.ticks.y=element_blank(),
axis.title.x=element_text(size=rel(.75)),
panel.grid.major.x=element_blank(),
panel.grid.major.y=element_line(size=0.5),
panel.grid.minor.x=element_blank())
To get the data for this problem, we’ll use the robotstxt package to check that it’s ok to scrape data from Rotten Tomatoes and then use the rvest package to get data from the web site.
paths_allowed() function from robotstxt to make sure it’s ok to scrape https://www.rottentomatoes.com/browse/box-office/. Then use rvest functions to find relative links to individual movies listed on this page. Finally, paste the base URL to each to create a character vector of URLs.Display the first six lines of the vector.
getUrls <- function(baseUrl)
{
if(paths_allowed(baseUrl)){
linkData <- read_html("https://www.rottentomatoes.com/browse/box-office/") %>%
html_nodes("[target='_top']") %>%
html_attr("href") %>%
paste( "https://www.rottentomatoes.com", ., sep="")
}
}
linkData <- getUrls("https://www.rottentomatoes.com/browse/box-office/")
print(linkData[1:6])
## [1] "https://www.rottentomatoes.com/m/abominable/"
## [2] "https://www.rottentomatoes.com/m/downton_abbey/"
## [3] "https://www.rottentomatoes.com/m/hustlers_2019/"
## [4] "https://www.rottentomatoes.com/m/it_chapter_two/"
## [5] "https://www.rottentomatoes.com/m/ad_astra/"
## [6] "https://www.rottentomatoes.com/m/rambo_last_blood/"
do.call() / rbind() / lapply() or dplyr::bind_rows() / purrr::map() to create a three column data frame (or tibble).Display the first six lines of your data frame.
(Results will vary depending on when you pull the data.)
For help, see this SO post: https://stackoverflow.com/questions/36709184/build-data-frame-from-multiple-rvest-elements
library(stringr)
getDataFromLink <- function(url){
out <- tryCatch({
web <- read_html(url)
},
error = function(e) return(c("error"))
)
if(out[1] != "error"){
title <- web %>% html_nodes("[class='mop-ratings-wrap__title mop-ratings-wrap__title--top']") %>% html_text()
score <- web %>% html_nodes("[class='mop-ratings-wrap__percentage']") %>% html_text() %>% str_extract_all("\\(?[0-9]+%")
tomatometer = "NA"
audience = "NA"
if(length(score) != 0){
tomatometer = score[[1]]
}
if(length(score) == 2){
audience = score[[2]]
}
data_frame(title,tomatometer, audience)
}
}
webData <- bind_rows(lapply(linkData, getDataFromLink))
webData
Write your data to file so you don’t need to scrape the site each time you need to access it.
ggplot(webData, aes(tomatometer, title)) + geom_point() + theme_dotplot
ggplot(webData, aes(y=title)) + geom_point(aes(x = tomatometer, colour = "tomatometer score")) + geom_point(aes(x = audience, colour = "audience score")) + theme_dotplot
library(plotly)
f <- list(
family = "Courier New, monospace",
size = 18,
color = "#7f7f7f"
)
x <- list(
title = "tomatometer score",
titlefont = f
)
y <- list(
title = "audience score",
titlefont = f
)
webDataJulyUrls <- getUrls("https://www.rottentomatoes.com/browse/box-office/?rank_id=11&country=us")
webDataJuly <- bind_rows(lapply(webDataJulyUrls, getDataFromLink))
plot_ly(webDataJuly, x = ~as.numeric(sub("%", "", tomatometer)), y = ~as.numeric(sub("%", "", audience)), text=~title) %>%
layout(xaxis = x, yaxis = y) %>%
add_markers()
### 3. Weather
[14 points]
library(nycflights13)
Data: weather dataset in nycflights13 package (available on CRAN)
For parts (a) - (d) draw four plots of wind_dir vs. humid as indicated. For all, adjust parameters to the levels that provide the best views of the data.
g <- ggplot(weather, aes(x=wind_dir, y=humid))
g + geom_point(alpha=0.2, stroke=0)
g + geom_point(alpha=0.2, stroke=0) + geom_density_2d()
g + geom_hex()
g + geom_bin2d()
Describe noteworthy features of the data, using the “Movie ratings” example on page 82 (last page of Section 5.3) as a guide.
Draw a scatterplot of humid vs. temp. Why does the plot have diagonal lines?
ggplot(weather, aes(x=humid, y=temp)) + geom_point()
weather dataset. Which pairs of variables are strongly positively associated and which are strongly negatively associated?pairs(weather[5:15])
(h) Color the points by
origin. Do any new patterns emerge?
pairs(weather[5:15], col=factor(weather$origin))